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ABSTRACT 


We discuss recent theorems proving that artificial neural networks are 
capable of approximating an arbitrary mapping and its derivatives as 
accurately as desired. This fact forms the basis for further results 
establishing the leamability of the desired approximations, using results 
from non-parametric statistics. These results have potential applications in 
robotics, chaotic dynamics, control, and sensitivity analysis (physics, 
chemistry, and engineering). We discuss an example involving learning the 
transfer function and its derivatives for a chaotic map. 
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Jordan (1989), "Generic Constraints on Underspecified Target Trajectories," 


Proceedings IJCNN, Washington D.C.: 


The Jacobian matrix dz/dx ... is the matrix that relates small changes in the 
controller output to small changes in the task space results and cannot be 
assumed to be available a priori, or provided by the environment. However, 
all of the derivatives in the matrix are forward derivates. They are easily 
obtained by differentiation if a forward model is available. The forward 
model itself must be learned, but this can be achieved directly by system 
identification. Once the model is accurate over a particular domain, its 
derivatives provide a learning operator that allows the system to convert 
errors in task space into errors in articulartory space and thereby change the 
controller. 
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* We are indebted to Angelo Melino for pressing us on the issue addressed here and to 
the referees for numerous helpful suggestions. White’s participation was supported by 
NSF Grant SES-8806990. 
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ABSTRACT 


We give conditions ensuring that multilayer feedforward networks with as few as a 
single hidden layer and an appropriately smooth hidden layer activation function are 
capable of arbitrarily accurate approximation to an arbitrary function and its derivatives. 
In fact, these networks can approximate functions that are not differentiable in the 
classical sense, but possess only a generalized derivative, as is the case for certain 
piecewise differentiable functions. The conditions imposed on the hidden layer 
activation function are relatively mild; the conditions imposed on the domain of the 
function to be approximated have practical implications. Our approximation results 
provide a previously missing theoretical justification for the use of multilayer 
feedforward networks in applications requiring simultaneous approximation of a function 

and its derivatives. 


63 


Relevant Application Areas: 


1. Robotics 


2. Chaotic Dynamics 


3. Control 


4. Sensitivity Analysis (Physics, Chemistry, Engineering) 



Intuition suggests that networks having smooth hidden layer activation functions 
ought to have output function derivatives that will approximate the derivatives of an 
unknown mapping. However, the justification for this intuition is not obvious. Consider 
the class of single hidden layer feedforward networks having network output functions 
belonging to the set 

Z(G)s {g : R r -> 1R I *(*)=£ PjG(x T Yj)\ 

;=l 

x e IR r , pj e IR t Yj g m r+1 ,j = l,...,q,qe IN), 

where x represents an r vector of network inputs (r € 1N= {1, 2 , x = (1,* ) 
(the superscript T denotes transposition), represents hidden to output layer weights 
and Yj represents input to hidden layer weights, j = 1 where q is the number of 
hidden units, and G is a given hidden unit activation function. The first partial 
derivatives of the network output function are given by 

dg(x ) / dxj = £ pjYjiDG(x T Yj), i = 

7-1 

where Jtj is the ith component of x, Yji is the ith component of 7y, / = 1,..., r (YjO 
input layer bias to hidden unit j ), and DG denotes the first derivative of G. 
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Figure 2 

Single Hidden Layer Feedforward Network 
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Outline: 


1. Mathematical Background 

2. Approximation Results 

3. Learning Results 

4. Example: Learning Chaotic Map 
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t MATHEMATICAL BACKGROUND 


Let U be an open subset of IR r , and let C (£/) be the set of all functions continuous 
on U. Let a be an r-tuplea = (a i, . . . ,a r ) T of non-negative integers (a "multi-index"). 
If jc belongs to R r , let = xf • ... • X T r . Denote by D a the partial derivative 

d\ a \ldx a s dl a l/(dx? 1 dx? 2 dx r r ) 

of order \a \ =a.\ +a 2 +...+ a r . For non-negative integers m, we define 

C m (U)= {/e C(i/): D a /e C(t/) for alia, |a| <m} and C°°(t/) = n m >i C m (G). 
We let D® be the identity, so that C®(U) = C(U). Thus, the functions in C m (U) have 
continuous derivatives up to order tn on U, while the functions in C (G) have 
continuous derivatives on U of every order. We shall be interested in approximating 
elements of C m (U ) using feedforward networks. When U & IR r , the fact that network 
output functions (elements of L(G)) will belong to C m ( lR r ) necessitates considering 
their restriction to U, written g | \j for g in E(G). Recall that g | u(x) — g{x) for x in U 
and is not defined for x not in U, thus g \ u e C m (U), as desired.) 
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DEFINITION 2.1: Let U be a subset of ZR r , let S be a collection of functions /: 
U — » 2R and let p be a metric on S. For any g in 2(G) (recall g: lR r IR) define 
the restriction of g to U, g \ u as g \u(x) = g (x) for x in U, g \ u (x) unspecified for x 
not in U. 

Suppose that for any / in S and £ > 0 there exists g in 2(G) such that 
P(/> 8 \U ) <e - Then we say that 2(G) contains a subset p -dense in S. If in addition 
g | u belongs to S for every g in 2(G), we say that 2(G) is p -dense in S. □ 

DEFINITION 2.2: Let m, l e (0) u IN, 0 £ m <. I, and U c R r be given, and let 
S c C l (U). Suppose that for any / in 5, compact K c U and £ > 0 there exists g in 
2(G) such that max| 0 i sup xe ^ | D a f(x)-D a g(x) \ <£. Then we say that 
2(G) is m-uniformly dense on compacta in S. □ 

When 2(G) is m-uniformly dense on compacta in S, then no matter how we choose 
an / in 5, a compact subset K of U, or the accuracy of approximation £ > 0, we can 
always find a single hidden layer feedforward network having output function g (in 
2(G)) with all derivatives of g \ u on K up to order m lying within £ of those of / on K. 
This is a strong and very desirable approximation property. 
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The space L p (U,fl) is the collection of all measurable functions / such that 
||/||p Up - t 1 / I p dp] llp <°°,\<p< 00 , where the integral is defined in the 

sense of Lebesgue. When fl =A we may write either J [^/dA or J^/OOdx to denote 

the same integral. We measure the distance between two functions / and g belonging to 
L p (U,fi) in terms of the metric p Pt g ) s fl f-g || p , Utfl . Two functions that differ 
only on sets of fJ . -measure zero have p Pf u,p(f> 8) = We shall not distinguish between 
such functions. 

The first Sobolev space we consider is denoted S™(U,p), defined as the collection 
of all functions /in C m (U ) such that \\D a f\\ Pt u.p < ~ for all I a I <*m. We define 
the Sobolev norm ||/ \\ m ,p,U,p = (£, a | \\ D< * f\\p, U,p) l ‘ P - T** Sobolev metric is 

pZn(f,g)=\\f-8\\m.p,U,> i f,geSf(U,p). 

Note that p PfP depends implicitly on U, but we suppress this dependence for notational 
convenience. The Sobolev metric explicitly takes into account distances between 
derivatives. Two functions in S p (U,fl ) are close in the Sobolev metric p Ptfl when all 
derivatives of order 0 ^ I (X I < m are close in L p metric. 
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We also consider the Sobolev spaces 


WfiU)* {f e L Uoc (U ) I a“/€ Lp(U,X), 0< \ a I <m }. 

This is the collection of all functions having generalized derivatives belonging to 
L p (U,X) of order up to m. Consequently, W™(U) includes S™(U,X), as well as 
functions that do not have derivatives in the classical sense, such as piecewise 
differentiable functions. 

The norm on Wp(U) generalizes that on S™{U,X)\ we write it as 

ll/IL,,tf = < I WfVp, V .x)' IP feWfm 

\ a \ 

For the metric on W™(U) we suppress the dependence on U and write 

pP(f,8)*\\f-g\\m,p,U f>8 G Wp(U). 

Two functions are close in the Sobolev space W™(U) if all generalized derivatives are 
close in L p (U, A) distance. 
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Our results make fundamental use of one last function space, the space ( R ) 
of rapidly decreasing functions in C°°( IR r ). CX ( R r ) is defined as the set of all 
functions in C°°( lR r ) such that for all multi-indices a and p, x p D a f(x)-^0 as 
|* | where x& and \x I smaxi S , Sr lx,- I. Note that 

C5X m r )aCT( R r ). 

Desired results: 

1. ) X(G) is m- uniformly dense on compacta in ( lR r \ S™(U, A) 

2. ) Z(G) is -dense in Sj?( R r , p) 

3. ) L(G) isp” -dense in Wp(U ) 
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2. APPROXIMATION RESULTS 


THEOREM 3.1: Let G ± 0 belong to Sf ( 1R,X) for some integer m > 0. Then 2(G) 

i 

is m-uniformly dense on compacta in ( IR T )■ □ 

DEFINITION 3.2: Let l e {0} u W be given. G is l-finite if G e C l ( JR) and. 

0<J I D l G I dX <oo. □ 

LEMMA 3.3: If G is /-finite then for all 0 < m < l there exists H e Sf ( 1R,X),H ^0, 
such that L(//) c 2(G). □ 

/-finite activation functions G with j* D^G dX ^ 0 have J | D m G \ dX — °° for all m < /, 
and for m > / all /-finite activation functions G have J D m G dX = 0 (provided D m G 
exists). 

It is informative to examine cases not satisfying the conditions of the theorems. For 
example, if G = sin then G e C°°( 2R), but for all /, J I D^G I dX = °°. If G is a 
polynomial of degree w then again G s C°°( 5?), but for / < wt we have 
/ I D l G I dX = «*>, although J I D l G \ dX =0 for / > m. Consequently, neither 
trigonometric functions nor polynomials are /-finite. 
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COROLLARY 3.4: If G is /-finite, then for all 0 < m ^ /, 2(G) is m-uniformly dense 
on compacta in C X ( R r )- a 

COROLLARY 3.5: If G is /-finite, 0 < m £ /, and U is an open subset of R r then 
E(G) is m-uniformly dense on compacta in S^(U f X) for 1 <p <°°. □ 

COROLLARY 3.6: If G is /-finite and fl is compactly supported, then for all 0 < m <> l 
1(G) Sp( IR r , fl) and 1(G) ispj^-dense in S%( 

COROLLARY 3.8: If G is /-finite, 0 £ m < /, U is an open bounded subset of JR r and 
Co ( R r ) is Pp- dense in Wp(U) then 2(G) is also p^ 1 -dense in Wp(U). 

These results rigorously establish that sufficiently complex multilayer feedforward 
networks with as few as a single hidden layer are capable of arbitrarily accurate 
approximation to an unknown mapping and its (generalized) derivatives in a variety of 
precise senses. The conditions imposed on G are relatively mild; the conditions required 
of U have practical implications. 
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x i Y„ %i Y 12 Y 22 X 2 

Figure 1. Feedforward Network 

O input unit (x) multiplication unit 
GO activation unit © addition unit 

Note: biases not shown 
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Figure 2. Derivative Network 


O input unit ® multiplication unit 

GO activation unit © addition unit 
DGO activation derivative unit 

Note: biases not shown 
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ABSTRACT 


Recently, multiple input, single output, single hidden layer, feedforward 
neural networks have been shown to be capable of approximating a nonlinear map 
and its partial derivatives. Specifically, neural nets have been shown to be dense 
in various Sobolev spaces (Homik, Stinchcombe and White, 1989). Building 
upon this result, we show that a net can be trained so that the map and its 
derivatives are learned. Specifically, we use a result of Gallant (1987b) to show 
that least squares and similar estimates are strongly consistent in Sobolev norm 
provided the number of hidden units and the size of the training set increase 
together. We illustrate these results by an application to the inverse problem of 
chaotic dynamics: recovery of a nonlinear map from a time series of iterates. 
These results extend automatically to nets that embed the single hidden layer, 
feedforward network as a special case. 


78 



3. LEARNING RESULTS 


SETUP. We consider a single hidden layer feedforward network having network 
output function 

g K (x, 6) = £ PjG(x T n) 
i‘ i 

where x represents an r x 1 vector of network inputs (including a "bias unit"), ft y 
represents hidden to output layer weights, yj represents input to hidden layer 
weights, K is the number of hidden units, 

0' = (01,71*02*72 *--’.Pk,7k)* 
and G is the hidden unit activation function. 

We assume that the network is trained using data [y t ,x ,} generated 
according to 

yt = g*(x t ) + e t f = 1, 2, ..., n . 

x t denotes the observed input and e t denotes random noise. The number K n of 
hidden units employed depends on the size n of the training set. The network is 

A 

trained by finding gK n ( x > G ) that minimizes 

1,(8) = - i to - I PjGtf 7;)] 2 . 

" 1=1 j-l 

A 

subject to the restriction that 0 ) is a member of the estimation space Q. 
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REGULARITY CONDITIONS: 


Input space. The input space X is the closure of a bounded, open subset of R r . 

Parameter space. For some integer m, 0 < m < some integer p, 1 < p < °°, and 
some bound B, 0 < B < g * is a point in the Sobolev space ^m+[r//?]+i,p. x 
\\g*\\m+[r/p]+l,p, X<B- 


Activation function. The activation function G belongs to C m (IR ) and 
j°°(d m ldu m )G(u)du < See Section 3 of Homik, Stinchcombe and White 

— oo 

(1989). 

Estimation space. gK H ( x >Q) is restricted to Q- (g: ||g|U+[r/p]+i,/>. i* 1 

the optimization of s n (g). 

Training set. The empirical distribution of [x { )t= i converges to a distribution 
fi(x) and ii(O) > 0 for every open subset O of X 

Error process. The errors [e { ] are independently and identically distributed with 
common probability law P having j^eP (de ) = 0 and e 2 P(de) <°°. 

(f e 2 P(de) = 0 implies e t = 0 for all t.) 
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Independence. The probability law P of the errors does not depend on {^ir=i! 
that is, P(A) can be evaluated without knowledge of {*/}?=i> 

lim n _>oo(l/n)X" =1 x o etc - 


81 



THEOREM 1. Under the Regularity Conditions 

lim || g * - 8kS * , 0 ) Ik = 0 almost surely 

n— 

provided lim fl _^o K n — 00 almost surely. In particular, 

lim o\g Kn (x,b)] =o{g*) almost surely 

n -* <*> 

provided a is continuous with respect to || • || m , «>, *’ D 
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4. EXAMPLE: LEARNING CHAOTIC MAP 


Our investigation studies the ability of the single hidden layer network 


K 

g K (x t - 5 , . . . , x,-i) = % p j G <y 5 jX t -5 + ■ • • +Yij *t- 1 + YOj) 

;=1 


with logistic squasher 

G(w) = 1/[1 + exp(-M)] 

to approximate the derivatives of a discretized variant of the Mackey-Glass 
equation (Schuster, 1988, p. 120) 


g(x t - 5 , x,_i) = x t -i + (10.5) 


(0-2K-5 

1 + (x,_ 5 ) 10 


(0.1)x,-i . 


The values of the weights Pj and that minimize 
Sn(gK~) = S [^1 ~ gK&t— 5 » ■ ■ • > X(— i)] 

n t = 1 

were determined using the Gauss-Newton nonlinear least squares algorithm. Our 
rule relating K to n was of the form K « log(n) because asymptotic theory in a 
related context (Gallant, 1989) suggests that this is likely to be the relationship 
that will give stable estimates. 
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Figure 1 . Superimposed nonlinear map and neural net estimate 
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timate 







86 



Figure 
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